This section contains some basic FAQ and tips. It’s here at the top so that if you get stuck or confused, you can easily find it.
help(functionname) or ?functionname or searching for the function by name in the Help tab on the right. Function arguments have names, but names can be omitted if using them in their intended order; they can be looked up in the help files.Tools -> Global Options -> R Markdown -> untick “Show plots inline…”
This indicates the package is not loaded. Use the relevant library() command to load the package that includes the missing function. There are library("package") calls in the beginning of each section that requires them. You really need to load a package once per session, but they are there anyway to keep the script modular for easier revisiting. In general, it’s better practice to have library() calls in the head of the script file.
Either the package is not installed, or you misspelled its name. You should have installed the necessary packages before the start of the workshop. If you did not (indicating by library() giving you a “package not found” error), then here are the relevant installation commands.
# This is a code block, distinguishable by the gray shaded background.
# This is a line of code:
print( "Hello! Put your text cursor on this line (click on the line). Anywhere on the line. Now press CTRL+ENTER (PC) or CMD+ENTER (Mac). Just do it." )
# The command above, when executed (what you just did), printed the text in the console below. Also, this here is a comment. Commented parts of the script (anything after a # ) are not executed. This R Markdown file has both code blocks (gray background) and regular text (white background).
(Also, if you’ve been scrolling left and right in the script window to read the code, turn on text wrapping ASAP: on the menu bar above, go to Tools -> Global Options -> Code (tab on the left) -> tick “Soft-wrap R source files”)
So, print() is a function. Most functions look something like this:
myfunction(inputs, parameters)All the inputs to the function go inside the ( ) brackets, separated by commas. In the above case, the text is the input to the print() function. All text, or “strings”, must be within quotes. Most functions have some output. Note that commands may be nested; in this case, the innermost are evaluated first:
function2( function1(do, something), parameters_for_function1 )Don’t worry if that’s all a bit confusing for now. Let’s try another function, sum():
sum(1,10) # cursor on the line, press CTRL+ENTER (or CMD+ENTER on Mac)
# You should see the output (sum of 1 and 10) in the console.
# Important: you can always get help for a function and check its input parameters by executing
help(sum) # put the name of any function in the brackets
# ...or by searching for the function by name in the Help tab on the right.
# Exercise. You can also write commands directly in the console, and executing them with ENTER. Try some more simple maths - math in R can also be written using regular math symbols (which are really also functions). Write 2*3+1 in the console below, and press ENTER.
# Let's plot something. The command for plotting is, surprisingly, plot().
# It (often) automatically adopts to data type (you'll see how soon enough).
plot(42, main = "The greatest plot in the world") # execute the command; a plot should appear on the right.
# OK, that was not very exciting. But notice that a function can have multiple inputs, or arguments. In this case, the first argument is the data (a vector of length one), and the second is 'main', which specifies the main title of the plot.
# You can make to plot bigger by pressing the 'Zoom' button above the plot panel on the right.
# Let's create some data to play with. We'll use the sample() command, which creates random numbers from a predifined sample. Basically it's like rolling a dice some n times, and recording the results.
sample(x = 1:6, size = 50, replace = T) # execute this; its output is 50 numbers
# Most functions follow this pattern: there's input(s), something is done to the input, and then there's an output. If an output is not assigned to some object, it usually just gets printed in the console. It would be easier to work with data, if we saved it in an object. For this, we need to learn assignement, which in R works using the equals = symbol (or the <-, but let's stick with = for simplicity).
dice = sample(x = 1:6, size = 50, replace = T) # what it means: xdata is the name of a (new) object, the equals sign (=) signifies assignement, with the object on the left and the data on the right. In this case, the data is the output of the sample() function. Instead of printing in the console, the output is assigned to the object.
dice # execute to inspect: calling an object usually prints its contents into the console below.
# Let's plot:
hist(dice, breaks=20, main="Frequency of dice values") # plots a histogram (distribution of values)
plot(dice) # plots data as it is ordered in the object
xmean = mean(dice) # calculate the mean of the 50 dice throws
abline(h = xmean, lwd=3) # plot the mean as a horizontal line
# Exercise: compare this plot with your neighbor. Do they look the same? Why/why not?
# Exercise: use the sample() function to simulate 25 throws of an 8-sided DnD dice.
The next sections will go over basic data types and suitable plots. If we have time, we’ll also learn how to make app-like interactive plots.
Numerical values include things we can measure on a continuous scale (height, weight, reaction time), things that can be ordered (“rate this on a scale of 1-5”), and things that have been counted (number of participants in an experiment, number of words in a text).
We will be using the English visual lexical decision and naming reaction time dataset from the languageR package.
library(languageR) # load the necessary package
# To make things easier in the beginning, we'll subset the (rather large) dataset; just run the following line - we'll see how indexing and subsampling works later;
eng = english[ c(1:100, 2001:2100), c(1:5,7)]
# We can inspect the data using convenient R commands.
dim(eng) # dimensions of the data.frame
## [1] 200 6
summary(eng) # produces an automatic summary of the columns
## RTlexdec RTnaming Familiarity Word AgeSubject
## Min. :6.251 Min. :6.076 Min. :1.100 aisle : 1 old :100
## 1st Qu.:6.440 1st Qu.:6.153 1st Qu.:2.900 ant : 1 young:100
## Median :6.548 Median :6.324 Median :3.455 arc : 1
## Mean :6.563 Mean :6.329 Mean :3.648 are : 1
## 3rd Qu.:6.659 3rd Qu.:6.499 3rd Qu.:4.430 arm : 1
## Max. :7.154 Max. :6.648 Max. :6.730 art : 1
## (Other):194
## WrittenFrequency
## Min. : 0.6931
## 1st Qu.: 3.8067
## Median : 4.7832
## Mean : 4.9646
## 3rd Qu.: 6.0742
## Max. :11.2797
##
head(eng) # prints the first rows
## RTlexdec RTnaming Familiarity Word AgeSubject WrittenFrequency
## 1 6.543754 6.145044 2.37 doe young 3.912023
## 2 6.397596 6.246882 4.43 whore young 4.521789
## 3 6.304942 6.143756 5.60 stress young 6.505784
## 4 6.424221 6.131878 3.87 pork young 5.017280
## 5 6.450597 6.198479 3.93 plug young 4.890349
## 6 6.531970 6.167726 3.27 prop young 4.770685
# In RStudio, you can also have a look at the dataframe by clicking on the little "table" icon next to it in the Environment section (top right).
help(english) # built in datasets often have help files attached
eng$Familiarity # the $ is used for accessing (named) column of a dataframe (or elements in a list)
## [1] 2.37 4.43 5.60 3.87 3.93 3.27 3.73 5.67 3.10 4.43 3.27 6.73 1.72 3.30
## [15] 2.23 4.33 3.80 4.10 4.66 4.13 3.40 1.97 1.44 3.28 3.67 2.83 4.13 1.93
## [29] 3.03 3.90 6.03 3.73 5.23 5.27 4.63 3.50 3.63 2.03 5.57 6.17 4.47 4.40
## [43] 3.47 4.77 1.31 3.63 3.83 4.17 1.43 3.03 3.47 3.37 2.53 4.63 5.90 2.13
## [57] 2.73 2.60 4.73 3.13 2.90 1.90 3.40 3.23 4.30 4.53 5.97 3.60 3.93 2.67
## [71] 6.07 2.47 3.27 2.52 3.83 5.80 3.59 5.37 3.40 5.67 5.27 3.87 4.53 2.80
## [85] 4.30 3.70 3.13 2.80 5.50 6.52 3.77 2.40 2.56 4.10 4.73 4.13 3.17 2.60
## [99] 4.07 2.97 2.70 4.93 5.70 4.60 2.90 1.30 2.30 2.93 1.63 4.67 4.03 3.43
## [113] 3.53 3.33 2.67 3.47 2.70 5.00 4.66 1.33 3.40 2.90 2.57 3.07 3.93 2.70
## [127] 3.30 5.23 1.67 4.07 1.73 3.00 3.43 5.47 3.27 4.07 2.90 1.97 3.37 5.70
## [141] 4.43 5.47 2.33 3.00 4.13 3.10 3.27 5.13 2.73 4.72 1.83 5.67 3.38 4.67
## [155] 4.17 5.87 3.53 5.67 1.63 3.60 2.83 3.50 2.80 3.43 4.67 3.37 1.10 3.38
## [169] 2.30 3.27 2.93 3.13 3.30 4.20 6.17 3.20 3.77 1.72 4.83 2.90 2.09 5.40
## [183] 4.47 3.43 3.03 2.93 3.30 4.07 1.90 3.20 3.20 4.37 4.73 3.41 4.27 3.44
## [197] 4.03 4.20 3.13 1.23
eng[, "Familiarity"] # this is the other indexing notation: [row, column]
## [1] 2.37 4.43 5.60 3.87 3.93 3.27 3.73 5.67 3.10 4.43 3.27 6.73 1.72 3.30
## [15] 2.23 4.33 3.80 4.10 4.66 4.13 3.40 1.97 1.44 3.28 3.67 2.83 4.13 1.93
## [29] 3.03 3.90 6.03 3.73 5.23 5.27 4.63 3.50 3.63 2.03 5.57 6.17 4.47 4.40
## [43] 3.47 4.77 1.31 3.63 3.83 4.17 1.43 3.03 3.47 3.37 2.53 4.63 5.90 2.13
## [57] 2.73 2.60 4.73 3.13 2.90 1.90 3.40 3.23 4.30 4.53 5.97 3.60 3.93 2.67
## [71] 6.07 2.47 3.27 2.52 3.83 5.80 3.59 5.37 3.40 5.67 5.27 3.87 4.53 2.80
## [85] 4.30 3.70 3.13 2.80 5.50 6.52 3.77 2.40 2.56 4.10 4.73 4.13 3.17 2.60
## [99] 4.07 2.97 2.70 4.93 5.70 4.60 2.90 1.30 2.30 2.93 1.63 4.67 4.03 3.43
## [113] 3.53 3.33 2.67 3.47 2.70 5.00 4.66 1.33 3.40 2.90 2.57 3.07 3.93 2.70
## [127] 3.30 5.23 1.67 4.07 1.73 3.00 3.43 5.47 3.27 4.07 2.90 1.97 3.37 5.70
## [141] 4.43 5.47 2.33 3.00 4.13 3.10 3.27 5.13 2.73 4.72 1.83 5.67 3.38 4.67
## [155] 4.17 5.87 3.53 5.67 1.63 3.60 2.83 3.50 2.80 3.43 4.67 3.37 1.10 3.38
## [169] 2.30 3.27 2.93 3.13 3.30 4.20 6.17 3.20 3.77 1.72 4.83 2.90 2.09 5.40
## [183] 4.47 3.43 3.03 2.93 3.30 4.07 1.90 3.20 3.20 4.37 4.73 3.41 4.27 3.44
## [197] 4.03 4.20 3.13 1.23
# Plotting time! Let's explore for example the "familiarity" score distribution
plot(eng$Familiarity ) # the x-axis is just the index, the order the values are in the dataframe
hist(eng$Familiarity, breaks=10) # a histogram shows the distribution of values ('breaks' change resolution)
boxplot(eng$Familiarity) # a boxplot is like a visual summary()
stripchart(eng$Familiarity, vertical=T, add=T) # points could be added with points() or stripchart(add=T)
Exercises: let’s practise modifying function parameters by fiddling a bit with this plot (and let’s see if we can improve the rather basic default look). Make a new code block here for the exercises (click the green insert on the toolbar above), and copy the boxplot and stripchart lines from above.
text() function to add the labels summary() would give you, to the plot, for example, make it say “median” next to the boxplot, where the median line is (i.e., with coordinates x=1.3 and y=3.455). This could be achieved with either a single text() call, or with multiple individual ones, one for each statistic. Preferably do this programmatically, rather than copy-pasting values from summary() by hand.text(). Note that there’s quite a few words, so they’re likely to obscure each other - change text size using the cex parameter, or better yet, only add a subset of the words, either using sample(), or a sequence of indices (e.g. seq(1,100,3)).# Another way to plot boxplots, grouping them by some relevant variable:
boxplot(eng$RTnaming ~ eng$AgeSubject, main="Reaction time by age") # note the ~ notation
grid(col=rgb(0,0,0,0.3)) # why not add a grid for reference
# A slightly nicer version:
boxplot(eng$RTnaming ~ eng$AgeSubject, main="Reaction time by age", ylab="Reaction time",
border=c("brown", "forestgreen"), boxwex=0.7, cex=0.4)
abline(h=seq(6,7,0.1), col=rgb(0,0,0,0.1)) # adds horizontal lines instead of full grid
The rgb(red, green, blue, alpha) function allows making custom colors; alpha controls transparency. Possible values range between 0 and 1 by default. Below is a piece of code that generates an example of how the color scheme works (don’t worry if you don’t understand the actual code, this is above the level of this workshop; just put the cursor in the code block and press CTRL+SHIFT+ENTER (CMD+SHIFT+ENTER on Mac).
Another good way to use colors is to use ready-made palettes.
plot(eng$WrittenFrequency, eng$Familiarity) # scatterplot using base graphics
# This workshop is not focused on actual statistical techniques (maybe another time!), but in case you ever need to plot a regression line*:
plot(WrittenFrequency ~ Familiarity, data=eng, col="black", pch=20)
grid(col=rgb(0,0,0,0.2), lty=1)
# do the regression analysis:
# use the same formula notation as above, and the same data parameter, as the input for lm()
# use the lm(...) as an input to abline()
# abline can handle the output of the lm (linear model) command, extracting the intercept and beta coefficient
# could also adjust the look of abline a bit with: col=rgb(0,0,0,0.3), lwd=3
# *Of course the data is actually more complex (consisting of distinct groups), so a proper regression model should take that into account.
We have now seen how to visualize some data using R’s basic plotting tools, and picked up some basic R skills on the way. Before diving into various other things like networks and maps, we’ll have a look at an alternative plotting package, ggplot2. It uses a different approach to plotting, and a slightly different syntax. It also comes with default colors and aesthetics which many people find nicer than those of the base plot(). A particularly useful feature of ggplot2 is its extendability (or rather the fact people are eager to extend it), with an ever-growing list of addon-packages on CRAN with an extended selection of themes and more niche visualizations.
library(ggplot2) # load ggplot2
# We're using the same english dataset subset (eng) as in the first section.
ggplot(eng, aes(x=WrittenFrequency, y=Familiarity)) +
geom_point()
# the data are defined in the ggplot command, aes() specifies variables and grouping variables
# the + adds layers, themes and other options
Exercises:
col=AgeSubject, shape=AgeSubjectto the aes() above to see for yourself.scale_colour_brewer(palette = "Dark2")WrittenFrequency and RTnaming (reaction time), using AgeSubject as the coloring variable; use geom_smooth(method="lm") to add regression lines (analogous to abline(lm())) from earlier.Sometimes you might be dealing with data restricted to a few values, or ordinal scales. Let’s see how plotting these might work. This part uses an artificial dataset of made-up agreement values on statements about language in the workplace.
library(ggplot2)
library(ggmosaic)
set.seed(1); x = sample(1:5, 200, T, prob = c(0.3,0.1,0.1, 0.2, 0.3))
workplace = data.frame(
monolingual = x, # Agree with "Workplaces should be monolingual"
preferfirst = pmax(1, pmin(5, x+sample(-2:2, length(x), T))), # Agree with "I prefer speaking my first language
age = round((x+20)*runif(length(x),1,2.5))
)
dim(workplace)
## [1] 200 3
head(workplace)
## monolingual preferfirst age
## 1 1 1 42
## 2 5 4 32
## 3 5 5 61
## 4 2 1 52
## 5 1 1 51
## 6 3 3 48
# We could look at each question separately:
ggplot(workplace, aes(x=monolingual)) +
geom_bar()
# What if we wanted to compare how responses to these similar questions interact? WIth two numerical vectors, we could use a scatterplot:
ggplot(workplace, aes(x=monolingual, y=preferfirst, color=age)) +
geom_point(alpha=0.8)
# ...but this is not very useful, is it..?
Exercise. Make this plot better.
position_jitter(width=0.2,height=0.2)color=ageto the aes() above.+ xlab('Agree with "Workplaces should be monolingual"'), similarly for ylab().scale_color_distiller(palette = "Spectral") and theme_dark()Another approach is to treat the values as categorical, and produce a mosaic plot:
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggmosaic)
# These are the values we will be plotting (the table is ordered differently, look at it sideways)
xtabs(~ workplace$monolingual + workplace$preferfirst)
## workplace$preferfirst
## workplace$monolingual 1 2 3 4 5
## 1 39 6 6 0 0
## 2 6 6 2 1 0
## 3 4 7 5 2 2
## 4 0 9 9 8 22
## 5 0 0 15 16 35
# Plot:
ggplot(data = workplace %>% mutate_all(as.factor) ) +
geom_mosaic(aes(x = product(monolingual,preferfirst),fill=monolingual), na.rm=TRUE) +
scale_fill_hue(h = c(1, 200)) +
xlab("preferfirst") + ylab("monolingual")
Mosaic plots and heatmaps are sort of similar. Let’s have a look.
library(quanteda) # for tokenization
## Package version: 1.4.3
## Parallel computing: 2 of 4 threads used.
## See https://quanteda.io for tutorials and examples.
##
## Attaching package: 'quanteda'
## The following object is masked from 'package:utils':
##
## View
library(ggplot2)
library(stringdist) # to calculate string distance
library(reshape2) # needed to wrangle data into a ggplot2-friendly format
# Heatmaps and similar structures are useful for comparing many things with many other things (e.g. parameter values, co-occurrences, correlations)
# Let's calculate the edit distance of some words
words = tokens(tolower("Once upon a midnight dreary, while I pondered, weak and weary, Over many a quaint and curious volume of forgotten lore - While I nodded, nearly napping, suddenly there came a tapping."), remove_punct = T)
s = stringdistmatrix(unique(words[[1]]), useNames = T ) %>% as.matrix() %>% melt()
# plot the heatmap of string distance values:
ggplot(data=s, aes(y=Var1, x=Var2, fill=value)) +
geom_tile(colour = "lightgray") + ylab("") + xlab("") +
theme_minimal()
# Exercises:
# Discuss with a neighbor how to interpret this map.
# The default colour palette is not very contrastive; change it by adding + scale_fill_viridis_c()
# The x-axis labels are hard to read; add this: + theme(axis.text.x=element_text(angle=45, hjust=1))
# Correlation matrices may also be visualized as heatmaps
# Let's find correlations between numeric variables in the eng dataset
corrs = cor(eng[,c(1:3,6)])
# inspect the resulting object
# Larger correlation matrices hard to grasp, but visualization helps.
library(corrplot) # a little package that uses base graphics
corrplot(corrs)
# ggplot alternative (there's also the ggcorr which has extra options)
ggplot(data = melt(corrs), aes(x=Var1, y=Var2, fill=value)) +
geom_tile(color=NA) +
scale_fill_gradient2(low = "blue", high = "red", mid = "white",
midpoint = 0, limit = c(-1,1),
name="Correlation") +
coord_fixed() +
theme_minimal() + labs(x="",y="")
While a whole subject on its own, we will have a quick look at plotting time series - data reflecting changes in some variable over time.
library(quanteda, quietly = T) # load a corpus management package; we'll also make use of a dataset in it
# let's inspect the data first:
length(data_corpus_inaugural$documents$texts)
## [1] 58
rownames(data_corpus_inaugural$documents)
## [1] "1789-Washington" "1793-Washington" "1797-Adams"
## [4] "1801-Jefferson" "1805-Jefferson" "1809-Madison"
## [7] "1813-Madison" "1817-Monroe" "1821-Monroe"
## [10] "1825-Adams" "1829-Jackson" "1833-Jackson"
## [13] "1837-VanBuren" "1841-Harrison" "1845-Polk"
## [16] "1849-Taylor" "1853-Pierce" "1857-Buchanan"
## [19] "1861-Lincoln" "1865-Lincoln" "1869-Grant"
## [22] "1873-Grant" "1877-Hayes" "1881-Garfield"
## [25] "1885-Cleveland" "1889-Harrison" "1893-Cleveland"
## [28] "1897-McKinley" "1901-McKinley" "1905-Roosevelt"
## [31] "1909-Taft" "1913-Wilson" "1917-Wilson"
## [34] "1921-Harding" "1925-Coolidge" "1929-Hoover"
## [37] "1933-Roosevelt" "1937-Roosevelt" "1941-Roosevelt"
## [40] "1945-Roosevelt" "1949-Truman" "1953-Eisenhower"
## [43] "1957-Eisenhower" "1961-Kennedy" "1965-Johnson"
## [46] "1969-Nixon" "1973-Nixon" "1977-Carter"
## [49] "1981-Reagan" "1985-Reagan" "1989-Bush"
## [52] "1993-Clinton" "1997-Clinton" "2001-Bush"
## [55] "2005-Bush" "2009-Obama" "2013-Obama"
## [58] "2017-Trump"
head(tokens(data_corpus_inaugural$documents$texts[[1]])[[1]],30)
## [1] "Fellow-Citizens" "of" "the"
## [4] "Senate" "and" "of"
## [7] "the" "House" "of"
## [10] "Representatives" ":" "Among"
## [13] "the" "vicissitudes" "incident"
## [16] "to" "life" "no"
## [19] "event" "could" "have"
## [22] "filled" "me" "with"
## [25] "greater" "anxieties" "than"
## [28] "that" "of" "which"
# Exercise. Use summary() on data_corpus_inaugural$documents. Then have a look at speech number 58, and find out who's giving the speech (hint: presidents are recorded in the same dataframe)
# The following line of code will tokenize the US Presidents' inaugural speeches corpus and count the words
nw = data.frame(length=ntoken(tokens(data_corpus_inaugural$documents$texts)),
year=data_corpus_inaugural$documents$Year,
president = data_corpus_inaugural$documents$President )
ggplot(nw, aes(x=year, y=length)) +
theme_minimal() +
geom_point()
Exercises
library(quanteda)
# prepare the data, a tokenized corpus:
tok = tokens_tolower(tokens(data_corpus_inaugural$documents$texts)); names(tok)=data_corpus_inaugural$documents$Year
# inspect the first 10 elements of the first element of the list using tok[[1]][1:10]
# The following lines of code will extract & count mentions of the target words in the US Presidents' inaugural speeches corpus
# This will also serve as an introduction to writing custom functions
# The syntax: functionname = function(inputs/parameters){ function body; end with return() }
# you can specify default values of parameters (below word 2 is set to null so the function can be used with a single word)
# To use a function, you have to run its description first, saving it in the environment
findmentions = function(textlist, targets){
results = data.frame(term=NULL, freq=NULL, year=NULL)
for(i in seq_along(targets)){ # loops over targets
# this applies the grep function to the list of texts to find and count mentions:
freq = sapply(textlist, function(x) length(grep(targets[i], x))/length(x)*1000 )
term = gsub("^(.....).*", "\\1", targets[i]) # use the first 5 characters for a shorthand
# concatenate the results:
results = rbind(results, data.frame(term=rep(term, length(textlist)),
freq,
year=as.numeric(names(textlist))
))
}
return(results)
}
# run the function desctiption above; then try out the command below
# the inputs are:
# textlist, a list of tokenized texts (which we tokenized above)
# a character vector of targets; since they are used in grep, may be regex, or may be just a single word
freqs = findmentions(textlist=tok,
targets=c("^(he|him|m[ea]n|boys*|male|mr|sirs*|gentlem[ae]n)$",
"^(she|her|wom[ea]n|girls*|female|mrs|miss|lad(y|ies))$")
)
ggplot(freqs, aes(x=year, y=freq, color=term)) +
geom_line() + geom_point() +
labs(y="Frequency per 1000 tokens")
# Exercises:
# Add theme_minimal() or theme_gray() or theme_dark() for an automatic grid
# Google rcolorbrewer palettes and fiddle with the colors (e.g. scale_color_brewer(palette="Pastel1") )
# Define your own regex and use the findmentions() function again (or just put in a single word, if you don't know regex) and visualize some more comparisons.
# Exercise: if you have time, might as well explore the corpus a bit; use the kwic() function:
kwic(data_corpus_inaugural, "wom[ae]n", valuetype = "regex", window = 3)
##
## [1913-Wilson, 363] noble men and | women |
## [1913-Wilson, 584] the men and | women |
## [1913-Wilson, 1261] men and its | women |
## [1913-Wilson, 1321] if men and | women |
## [1921-Harding, 1683] every man and | woman |
## [1921-Harding, 2555] nation-wide induction of | womanhood |
## [1925-Coolidge, 2931] The men and | women |
## [1925-Coolidge, 4340] intuitive counsel of | womanhood |
## [1929-Hoover, 1060] by men and | women |
## [1929-Hoover, 1210] honest men and | women |
## [1937-Roosevelt, 1632] are men and | women |
## [1937-Roosevelt, 1639] ; men and | women |
## [1937-Roosevelt, 1651] ; men and | women |
## [1941-Roosevelt, 509] individual men and | women |
## [1945-Roosevelt, 102] which men and | women |
## [1981-Reagan, 697] of men and | women |
## [1981-Reagan, 2162] free men and | women |
## [1985-Reagan, 550] when men and | women |
## [1985-Reagan, 1148] working men and | women |
## [1989-Bush, 540] . Men and | women |
## [1989-Bush, 917] the men and | women |
## [1989-Bush, 1158] There are young | women |
## [1989-Bush, 2256] , and the | women |
## [1993-Clinton, 174] of men and | women |
## [1997-Clinton, 323] and dignity to | women |
## [2005-Bush, 346] every man and | woman |
## [2005-Bush, 715] chains or that | women |
## [2005-Bush, 1805] on men and | women |
## [2009-Obama, 583] often men and | women |
## [2009-Obama, 675] these men and | women |
## [2009-Obama, 1018] free men and | women |
## [2009-Obama, 1416] every man, | woman |
## [2009-Obama, 2392] why men and | women |
## [2013-Obama, 1293] brave men and | women |
## [2013-Obama, 1640] those men and | women |
## [2017-Trump, 402] forgotten men and | women |
## [2017-Trump, 1254] great men and | women |
##
## exhibited in more
## and children upon
## and its children
## and children be
## is called under
## into our political
## of this country
## , encouraging education
## of good will
## is to discourage
## of good will
## who have more
## who have cool
## joined together in
## and children will
## who raise our
## . It is
## are free to
## , by sending
## of the world
## who work with
## to be helped
## who will tell
## whose steadfastness and
## . Now,
## on this Earth
## welcome humiliation and
## who look after
## obscure in their
## struggled and sacrificed
## can achieve when
## , and child
## and children of
## in uniform,
## , sung and
## of our country
## of our military
# adjust the window parameter, or adjust your actual RStudio window/pane size, if the kwic's are not lined up nicely in the console
# Let's create some artificial data again. The places are real though.
places = data.frame(
name=c('Arivruaich','Adgestone','Allerthorpe','Annesley Lane End','Atherstone','Acklam','Ailsworth','Acrise','Ardlawhill','Angram'),
lng = c(-6.66245,-1.15953,-0.80909,-1.28971,-1.54642,-0.80555,-0.35292,1.13618,-2.2111,-1.20958),
lat = c(58.06239,50.67265,53.91595,53.07087,52.57722,54.0452,52.57579,51.13863,57.65223,53.93104)
); places$value=places$lat/10*runif(10, 0.95,1.05)
library(rworldmap) # this is new
## Loading required package: sp
## ### Welcome to rworldmap ###
## For a short introduction type : vignette('rworldmap')
library(magrittr)
library(ggplot2)
# So what's in the data?
ggplot(places, aes(y=value, x=name)) +
geom_bar(stat="identity") +
coord_flip()
# Mapping time. We'll fetch a generic map from the rworldmap package
data("countryExData", envir = environment(), package = "rworldmap")
uk = joinCountryData2Map(countryExData, joinCode = "ISO3", nameJoinColumn = "ISO3V10", mapResolution = "low") %>% fortify(mymap) %>% subset(id=="United Kingdom")
## 149 codes from your data successfully matched countries in the map
## 0 codes from your data failed to match with a country code in the map
## 95 codes from the map weren't represented in your data
## Regions defined for each Polygons
# Let's just plot the map first. coord_fixed makes sure the map stays propostional.
ggplot() +
geom_polygon(data=uk, aes(long, lat, group = group), inherit.aes = F) +
coord_fixed()
# Exercise: specify fill="lightgray" for geom_polygon to get a lighter base map; or use color="black", fill="white" to plot only the outlines.
# add + theme_bw() for a different theme
# remove the useless axis labels with: + theme(axis.title = element_blank())
# We could now put the points on the map. The coordinates in the dataframe could be plotted as a regular scatterplot:
ggplot(places, aes(x=lng, y=lat, color=value)) +
geom_point() +
scale_color_viridis_c(option="C")
# That's not very useful on it's own though...
Exercises.
By the way, the plotly package we’ll use soon enough gets along with ggplot2 very nicely, and you can convert plots created using the latter into interactive ones using the ggplotly function. There is also the gganimate package which can be used to create animated plots in the form of GIFs. plotly can do animations as well, but interactive, which we’ll see later.
This would be a good point to introduce magrittr’s pipe %>% command. It’s super useful! The shortcut in RStudio is CTRL-SHIFT-M (or CMD-SHIFT-M). If you’re familiar with Bash pipes: it’s the same thing. If you’re interested why the somewhat curious name: https://en.wikipedia.org/wiki/The_Treachery_of_Images
library(magrittr)
# Exercise. Try it out and discuss the results with your neighbor.
1:3
## [1] 1 2 3
sum(1:3)
## [1] 6
x=1:3
sum(x)
## [1] 6
1:3 %>% sum() # same result, and not much difference in spelling it out either
## [1] 6
1:3 %>% sum() %>% rep(times=4) # what does that do?
## [1] 6 6 6 6
# "." can be used as a placeholder if the input is not the first argument, so the above could also be spelled out as:
1:3 %>% sum(.) %>% rep(., times=4) # or
## [1] 6 6 6 6
1:3 %>% sum(.) %>% rep(., 4) # and it's the same as
## [1] 6 6 6 6
rep(sum(1:3), times=4)
## [1] 6 6 6 6
# another example:
c(1,1,1,2) %>% match(x=2, table=.) #
## [1] 4
# something longer (take it apart to see how it works):
"hello" %>% gsub(pattern="h", replacement="H", x=.) %>% paste(., "world")
## [1] "Hello world"
Categorical/nominal/discrete values cannot be put on a continuous scale or ordered, and include things like binary values (student vs non-student) and all sorts of labels (noun, verb, adjective). Words in a text could be viewed as categorical data.
# We can also visualize categorical (countable) data. This uses the eng dataframe again from above.
ggplot(eng, aes(x=AgeSubject)) +
geom_bar()
# Well this was boring. Let's see what letters are used in the words that make up the stimuli in the reaction time data. This bit of code splits the words up and counts them:
lets = eng$Word %>% as.character() %>% strsplit("") %>% unlist() %>% table() %>% data.frame()
ggplot(lets, aes(x=reorder(., Freq), y=Freq)) +
geom_bar(stat="identity") +
xlab("letters") +
theme_bw()
library(wordcloud)
## Loading required package: RColorBrewer
library(magrittr)
library(quanteda)
library(reshape2)
# Let's create an object with a bunch of text:
sometext = "In a hole in the ground there lived a hobbit. Not a nasty, dirty, wet hole, filled with the ends of worms and an oozy smell, nor yet a dry, bare, sandy hole with nothing in it to sit down on or to eat: it was a hobbit-hole, and that means comfort. It had a perfectly round door like a porthole, painted green, with a shiny yellow brass knob in the exact middle. The door opened on to a tube-shaped hall like a tunnel: a very comfortable tunnel without smoke, with panelled walls, and floors tiled and carpeted, provided with polished chairs, and lots and lots of pegs for hats and coats—the hobbit was fond of visitors. The tunnel wound on and on, going fairly but not quite straight into the side of the hill — The Hill, as all the people for many miles round called it — and many little round doors opened out of it, first on one side and then on another. No going upstairs for the hobbit: bedrooms, bathrooms, cellars, pantries (lots of these), wardrobes (he had whole rooms devoted to clothes), kitchens, dining-rooms, all were on the same floor, and indeed on the same passage. The best rooms were all on the left-hand side (going in), for these were the only ones to have windows, deep-set round windows looking over his garden, and meadows beyond, sloping down to the river."
# Now let's do some very basic preprocessing to be able to work with the words in the text:
words = gsub("[[:punct:]]", "", sometext) %>% # remove punctuation
tolower() %>% # make everything lowecase
strsplit(., split=" ") %>% unlist() # tokenize; the unlist is due to strsplit's default list output
# Inspect the object we just created. It should be a vector of 236 words.
# Quick magrittr Exercise: rewrite the following lines as a single command with %>%
x = grep("hobbit", words)
n = length(x)
txt = paste("Hobbits are mentioned", n, "times.")
print(txt, quote=F)
## [1] Hobbits are mentioned 4 times.
# Some ways to inspect and visualize textual data
sortedwords = table(words) %>% sort(decreasing = T) # counts the words and sorts them
# Exercise: have a look at the data using the head() and tail() commands
sortedwords %>% # take the object
.[1:30] %>% # use top 30 only (it's sorted already)
melt(value.name = "count") %>% # melt it into a ggplot-friendly dataframe
ggplot(aes(x=words, y=count) ) + # feed the result as data to ggplot
geom_bar(stat="identity") + # barplot of the counts
coord_flip() + # horizontal is probably easier to read
theme_gray()
# Time to use the quanteda package we loaded earlier.
# We can use it for all the preprocessing as well as the wordclouds:
parsed = dfm(sometext,
remove = stopwords('english'),
remove_punct = TRUE,
stem = FALSE)
parsed[,1:10] # quick look at the new data structure
## Document-feature matrix of: 1 document, 10 features (0.0% sparse).
## 1 x 10 sparse Matrix of class "dfm"
## features
## docs hole ground lived hobbit nasty dirty wet filled ends worms
## text1 3 1 1 3 1 1 1 1 1 1
textplot_wordcloud(parsed, min_count = 1, color=terrain.colors(100))
# Exercise: try setting stemming to TRUE and see how that changes the picture.
# once you are done with this part, execute this to clear the plotting area parameters:
dev.off()
## null device
## 1
We’ll keep using ggplot, but do something different for a change, looking at different ways of visualizing distributions, and how visualization choices can lead to different and sometimes unintended interpretations.
library(cowplot)
##
## Attaching package: 'cowplot'
## The following object is masked from 'package:ggplot2':
##
## ggsave
library(ggplot2)
# A note on geom_smooth(), the ggplot2 "smoothed conditional means" function - it attempts to fit a model to the data, by default either a Loess or GAM curve. While this is a convenient function in itself, it should be used only if one understands how these regression methods work and what their interpretation is - particularly that of Loess, which is often misused.
d=data.frame(time=1:40, value=c(rlnorm(39,2,0.2),20))
plot_grid(
ggplot(d , aes(x=time, y=value)) + geom_point() + geom_smooth(method = "loess", span=0.2) + labs(subtitle = "loess, 0.2"),
ggplot(d , aes(x=time, y=value)) + geom_point() + geom_smooth(method = "loess", span=1) + labs(subtitle = "loess, 1"),
ggplot(d , aes(x=time, y=value)) + geom_point() + geom_smooth(method = "lm") + labs(subtitle = "lm"),
nrow = 3
)
library(ggplot2)
library(cowplot) # provides plot_grid()
library(ggbeeswarm) # an additional geom
set.seed(1); x2=round(rnorm(400,35,10))+30; x1=round(rnorm(1000,35,10)) # nevermind the random data creation for now, just run this line, and then focus on the plotting code below:
# Poll: how likely is it that these are samples from the same distribution/population, or are on average similar?
plot_grid(
ggplot() + aes(x1) + geom_bar(width=1) + theme_gray(base_size=8)+labs(title="Are these samples likely \ndrawn from the same population?"),
ggplot() + aes(x2) + geom_bar(width=1) + theme_gray(base_size=8)+labs(title="\n")
)
options(scipen=999)
ks.test(x1,x2)
## Warning in ks.test(x1, x2): p-value will be approximate in the presence of
## ties
##
## Two-sample Kolmogorov-Smirnov test
##
## data: x1 and x2
## D = 0.871, p-value < 0.00000000000000022
## alternative hypothesis: two-sided
# Step 2: lims(x, y)
# Visualizing distributions with different methods.
set.seed(5);x=c(runif(50,1,160), rnorm(100,60,10), rnorm(100,100,10)) # some more random data, just run it
# Question: is this variable ~normally distributed? (same data, just two different views)
plot_grid(
ggplot() + aes(x) + geom_histogram(binwidth = 23),
ggplot() + aes(x) + geom_density(adjust = 2) + geom_rug(color=rgb(0,0,0,0.2))
)
# Step 2: binwidth, adjust
# Here's another look at the same data:
plot_grid(align = "h",
ggplot() + geom_boxplot(aes(x=0,y=x),width=0.7) + xlim(-1,1) + labs(x="",y="")+theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank()),
ggplot() + aes(0,(x))+ geom_bar(stat = "summary", fun.y = "mean") + stat_summary(geom = "errorbar", fun.data = mean_se, position = "dodge", width=0.2) + coord_cartesian(c(-1,1), c(1,150))+ labs(x="",y="")+theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank()),
ggplot() + aes(0,x) + geom_violin(adjust=1) + geom_point(shape=95, size=3, color=rgb(0,0,0,0.2))+ labs(x="",y="")+theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank()),
ggplot() + geom_beeswarm(aes(0,x))+ labs(x="same data, different plots",y="")+theme(
axis.text.x=element_blank(),
axis.ticks.x=element_blank())
)
# A recent popular innovation is the "raincloud" plot, a combination of the density or violin plot, the boxplot, and actual points.
library(ggstance) # this provides a horizontal boxplot geom for creating a raincloud plot; an official raincloud package should be in the works.
##
## Attaching package: 'ggstance'
## The following objects are masked from 'package:ggplot2':
##
## geom_errorbarh, GeomErrorbarh
ggplot() + aes(x = x) +
geom_density(position = position_nudge(y = 0.0025), alpha = .1, fill="blue") +
geom_point(aes(y=0), position = position_jitter(height = 0.002),
size=0.7, alpha = 0.3, color="blue", shape=1) +
geom_boxploth(aes(y=0), width=0.001, alpha=0) +
theme_minimal() +
theme(axis.text.y = element_blank(), axis.title = element_blank()) +
expand_limits(y = c(0, 0.03))
# About axes. Which of these three variables (y1, y2, y3) is experiencing the most drastic change over time?
set.seed(1); d=data.frame(y=sort(runif(10,3,4))*runif(10, 0.8,1.2), time=1:10)
plot_grid(ncol=3,
ggplot(d) + aes(x=time, y=y ) + geom_line(col="red", size=1.5) + ylab("series 1") +labs(title="") ,
ggplot(d) + aes(x=time, y=y ) + geom_line(col="orange",size=1.5) +ylim(0,5) + ylab("series 2") +
labs(title="Which series depicts the most drastic change over time?"),
ggplot(d) + aes(x=time, y=y ) + geom_line(col="darkblue", size=1.5) +ylim(0,20) + ylab("series 3") +labs(title="")
)
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
# plotly can be used to create the same sorts of plots as you've done with the base plot() and the ggplot() function, except interactive. Let's create an interactive time series plot.
# We'll reuse the findmentions() function from above and the tok object
tok = tokens_tolower(tokens(data_corpus_inaugural$documents$texts)); names(tok)=data_corpus_inaugural$documents$Year
findmentions = function(textlist, targets){
results = data.frame(term=NULL, freq=NULL, year=NULL)
for(i in seq_along(targets)){ # loops over targets
# this applies the grep function to the list of texts to find and count mentions:
freq = sapply(textlist, function(x) length(grep(targets[i], x))/length(x)*1000 )
term = gsub("^(.....).*", "\\1", targets[i])
results = rbind(results, data.frame(term=rep(term, length(textlist)), freq, year=as.numeric(names(textlist))))
}
return(results)
}
freqs = findmentions(textlist=tok,
targets=c("^(he|him|m[ea]n|boys*|male|mr|sirs*|gentlem[ae]n)$",
"^(she|her|wom[ea]n|girls*|female|mrs|miss|lad(y|ies))$")
)
plot_ly(freqs, x=~year, y=~freq, type="scatter", mode="lines", split=~term) %>%
layout(yaxis=list(title="Frequency per 1000 tokens"))
# Note the different syntax. There's pipes instead of +, options like layout are organized in lists; the split parameter defines groups (like color/group in ggplot2).
# Explore how the interactivity works in the new plot.
# But here's something interesting. Let's recreate the ggplot() version from earlier, but this time save it as an object
gp = ggplot(freqs, aes(x=year, y=freq, color=term)) +
geom_line() + geom_point() +
labs(y="Frequency per 1000 tokens")
gp # call it to have a look
# Now run this:
ggplotly(gp) # ggplot->plotly converter
# Let's try one of the reaction time plots:
gp = ggplot(eng, aes(x=WrittenFrequency, y=Familiarity, col=AgeSubject)) +
geom_point() + theme_gray()
gp # have a look at what that was
ggplotly(gp) # magic
Exercise. Make an interactive map. Copy the map you made above, of the UK places, which combined the polygon and points geoms. Follow the same steps as the other conversions here: assign the ggplot to an object, then run ggplotly on that object. Tip: you can add an additional “text” value to the plot’s aes() - anything specified there will be added to the hover labels in plotly, try e.g. text=name.
library(plotly)
# Here's a plot similar to what we've seen before:
plot_ly(data=eng, x=~Familiarity, y=~RTlexdec, type="scatter", mode="markers", color=~AgeSubject)
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
# Discuss the interpretation of the plot with your neighbor.
It might be useful to see how these two variables interact with some third variable of interest though. Exercise: make a copy of the code from above and carry out the following changes, inspecting the plot after every step.
text=~Word, name="" - the first adds words to the labels, the second removes the useless trace labelenglish (the whole dataset, instead of the subset we’ve been using)Bonus: here’s something completely useless, but maybe pretty:
# remember the RGB color plots from earlier, the ones with the black background?
col3 = data.frame(red=runif(1000),green=runif(1000),blue=runif(1000))
plot_ly(col3, x=~red,y=~green, z=~blue, type="scatter3d",mode="markers", marker=list(opacity=0.9),
color=I(apply(col3,1, function(x) rgb(x[1],x[2], x[3])))) %>% layout(paper_bgcolor='black') %>%
config(displayModeBar = F)
We’ll create a modified subset of the english dataset to produce some artificial language change data. The scenario: 10 words, over 100 years, observing the interplay of their homonymy and frequency values.
{ # just run this to create the semi-artificial dataset
eng2 = english[order(english$NumberSimplexSynsets*runif(nrow(english),0.9,1.1)),
c("WrittenFrequency", "NumberSimplexSynsets")] %>% .[seq(2001, 4000, 2),]
eng2$NumberSimplexSynsets = eng2$NumberSimplexSynsets *
rep(seq(0.8,1.2,length.out=10),100) *runif(100,0.9,1.1)
eng2$year = rep(seq(1800,1899,1),each=10)
eng2$word = as.factor(rep(1:10, 100))
}
# inspect the dataset first
# Plot the change over time:
plot_ly(eng2, x=~NumberSimplexSynsets, y=~WrittenFrequency,
type = 'scatter', mode = 'markers',
frame=~year, # the frame argument activates the animation functionality
color=~word, colors=brewer.pal(10,"Set3"), size=~WrittenFrequency,
marker=list(opacity=0.8)) %>%
layout(showlegend = FALSE) %>%
animation_opts(frame = 800, transition = 795) %>%
config(displayModeBar = F)
# Exercise: change frame and transition speed parameters to something different.
In the following examples, we’ll use the inaugural speeches of US presidents again. We’ll start by looking into which presidents mention or address other presidents in their speeches. We’ll extract the mentions programmatically rather than hand-coding them.
library(quanteda) # make sure this is loaded
library(igraph)
library(visNetwork)
speeches = gsub("Washington DC", "DC", data_corpus_inaugural$documents$texts) # replace city name to avoid confusion with the president Washington (hopefully)
speechgivers = data_corpus_inaugural$documents$President # names of presidents giving the speech
presidents = unique(data_corpus_inaugural$documents$President) # presidents (some were elected more than once)
# The following piece of code looks for names of presidents in the speeches using grep(). Just run this little block:
{
mentions = matrix(0, ncol=length(presidents), nrow=length(presidents),
dimnames=list(presidents, presidents))
for(president in presidents){
foundmentions = grep(president, speeches)
mentions[speechgivers[foundmentions], president ] = 1
}
}
# Note: this is not perfect - the code above concatenates mentions of multiple speeches by the same re-elected president, "Bush" as well as "Roosevelt" refer to multiple people, and other presidents might share names with other people as well. You can check the context of keywords using quanteda's kwic() command:
kwic(data_corpus_inaugural, "Monroe")
##
## [1885-Cleveland, 1202] It is the policy of | Monroe |
## [1909-Taft, 1784] bears the name of President | Monroe |
## [1925-Coolidge, 494] , and secured by the | Monroe |
##
## and of Washington and Jefferson
## . Our fortifications are yet
## doctrine. The narrow fringe
#
# Have a look at the data
mentions[30:35, 30:35] # rows: one mentioning; columns: being mentioned
## Carter Reagan Bush Clinton Obama Trump
## Carter 0 0 0 0 0 0
## Reagan 0 0 1 0 0 0
## Bush 1 1 1 1 0 0
## Clinton 0 0 1 0 0 0
## Obama 0 0 1 0 0 0
## Trump 1 0 1 1 1 0
counts = data.frame(names=colnames(mentions), count=apply(mentions, 2, sum))
ggplot(counts) +
geom_col(aes(y=count, x=names), fill= brewer.pal(3,"Set2")[1]) +
coord_flip() +
scale_x_discrete(limits = counts$names) +
theme_dark()
pgraph = graph_from_adjacency_matrix(mentions, mode="directed") # this uses igraph again
# you can have a look at the basic igraph plot if you want
# this uses visNetwork:
v = toVisNetworkData(pgraph)
visNetwork(nodes = v$nodes, edges = v$edges)
# check how it looks before we add all the fancy stuff
# Exercise: now use pipe %>% notation and the following functions to adjust the visNetwork plot (i.e., visNetwork(..) %>% visNodes(..) etc). See how the graph changes after each addition. Feel free to play around with the parameters!
# visNodes(size = 10, shadow=T, font=list(size = 30))
# visIgraphLayout("layout_in_circle", smooth=T) # steal a better layout from igraph
# visEdges(arrows = "to", shadow=T, smooth=list(type="discrete"), selectionWidth=5)
# visOptions(highlightNearest = list(enabled = T, hover = T, degree=1, labelOnly=F, algorithm="hierarchical"), nodesIdSelection = T) # interactive selection options
# Finally, click "Export" under the Viewer tab, and select "Save as webpage".
While we’re at it, let’s try to probe into the contents of the speeches and use some more interactive plotting tools to visualize it.
library(quanteda, warn.conflicts=F) # this needs to be loaded
library(plotly, warn.conflicts=F) # this is new
# This block of code will extract the top terms (after removing stopwords) from the speeches, calculate the distance between the speeches based on word usage, and compress it all into 2 dimensions.
termmat = dfm(data_corpus_inaugural, tolower = T, stem=F, remove=stopwords("english"), remove_punct=T)
topterms = lapply(topfeatures(termmat, n=10, groups=rownames(termmat)), names)
distmat = 1-textstat_simil(termmat, method="cosine") # calculate distances
mds = as.data.frame(cmdscale(distmat,k = 2)) # multidimentsional scaling (reduces distance matrix to 2 dimensions)
# have a look at the object using head()
mds$tags = paste(names(topterms), sapply(topterms, paste, collapse="<br>"), sep="<br>") # add top word labels to the data
mds$Year = data_corpus_inaugural$documents$Year # add the years to the new dataset for ease of use
# Exercise. The following makes use of the plotly package. Create one out of the following components
a = list(x=mds[55:58,1], y=mds[55:58,2], text=rownames(mds)[55:58], ax = -20, ay = 30, showarrow = T, arrowhead = 0) # this is a list with named elements that will be used to add some custom annotations; just run this line.
plot_ly(data = mds, x=~V1, y=~V2,
type="scatter", mode = 'markers',
hoverinfo = 'text', text=~tags
)
# this is the main plotly function - note the somewhat different usage of ~ here to specify variable names
# Exercises:
# add the following parameters to the function call above to color speeches by time: color=~Year
# pipe this in the end as well if you'd rather hide the color legend: %>% hide_colorbar()
# add annotations, use %>% layout(annotations = a )
# A look into the usage of some words across centuries
termmat_prop = dfm(data_corpus_inaugural,
tolower = T, stem=F,
remove=stopwords("english"),
remove_punct=T
) %>%
dfm_weight("prop") # use normalized frequencies
words = c("america", "states", "dream", "hope", "business", "peace", "war", "terror")
newmat = as.matrix(termmat_prop[,words]) %>% round(5)
plot_ly(x=words, y=rownames(termmat_prop), z=newmat, type="heatmap",
colors = colorRamp(c("white", "orange", "darkred")), showscale = F)
# Exercise (easy). Choose some other words! Also try changing the color palette (the function used here, colorRamp, takes a vector of colors as input and creates a custom palette).
# Add a nice background using %>% layout(margin = list(l=130, b=50), paper_bgcolor=rgb(0.99, 0.98, 0.97))
# Discuss the what you see on the plot with your neighbor.
# Exercise (a bit harder). We could get a better picture of what has been said by the presidents if we expanded our word search with regular expressions (^ stands for the beginning of a string and $ for the end, and . stands for any character, so ^white$ would match "white" but not "whites", and l.rd would match "lord" but also "lard" etc). Define some new search terms; below are some ideas.
words2 = c("america$", "^nation", "^happ", "immigra", "arm[yi]", "^[0-9,.]*$")
# The bit of code below uses grep() to match column names, so unless word boundaries are defined using ^$, any column name that *contains* the search string is also matched ("nation" would match "international"). For each search term, it will find and sum all matching rows.
newmat = sapply(words2, function(x) rowSums(termmat_prop[, grep(x, colnames(termmat_prop)) ])) %>% round(5)
# You can check which column names would be matched with:
grep("america", colnames(termmat_prop), value=T)
## [1] "american" "america" "americanism" "americans" "americas"
## [6] "america's" "american's"
# Then copy the plotly command from above and substitute the z parameter value with newmat.
It’s fairly straightforward to produce slides (websites, posters, books) in R using R Markdown, and export into html, pdf, or Word docx. We’ll need to create a new file for this part.
Exercise. Click on the icon with the green plus on a white background in the top left corner, choose “R Markdown…”, then “Presentation”, and then “Slidy”. Slidy is a basic, simple to use slide deck template (by the way, if you are willing to fiddle a bit with CSS, I’d recommend using the xaringan package instead, or if you’re really adventurous, slidify with impressjs).
Change the title to anything you want, and add author: your name into the YAML header on top. Now copy this code block (the entire block, starting with the ``` ) and use it to replace the short code block in the new file where it says “Slide with Plot”. Then click “Knit” (next to the little bar of yarn icon) on the top toolbar. RStudio will ask you to save the new file, just save it anywhere.
An important note on data: when producing an html file from an R Markdown rmd file, functions and objects in the current global environment cannot be accessed. That means that if you’re using a dataset from a package (like we’ve been doing), you’d need to load that package (i.e. include a library(package) call in a code block); if you’re using your own data, you need to include code to import it. It often makes sense to deal with data processing in a separate script, save the results as an .RData file, and then just load the RData (using load("file.RData")) in the markdown file you intend to knit, instead of doing data cleaning and analysis upon every time you re-knit.
A couple of examples of things that I’ve myself used R Markdown for:
Before we finish, a word on R and its packages. It’s all free open-source software, meaning countless people have invested a lot of their own time into making this possible. If you use R, do cite it in your work (use the handy citation() command in R to get an up to date reference, both in plain text and BibTeX). To cite a package, use citation("package name"). You are also absolutely welcome to use any piece of code from this workshop, but in that case I would likewise appreciate a reference:
Karjus, Andres (2018). aRt of the Figure. GitHub repository, https://github.com/andreskarjus/artofthefigure. Bibtex:
@misc{karjus_artofthefigure_2018, author = {Karjus, Andres}, title = {aRt of the Figure}, year = {2018}, publisher = {GitHub}, journal = {GitHub repository}, howpublished = {\url{https://github.com/andreskarjus/artofthefigure}}, DOI = {10.5281/zenodo.1213335} }
That’s it for today. Do play around with these things later when you have time, and look into the bonus sections for extras. If you get stuck, Google is your friend; also, check out www.stackoverflow.com - this site is a goldmine of programming (including R) questions and solutions.
Also, if you are looking for consulting on data analysis and visualization or more workshops, take a look at my website https://andreskarjus.github.io/ . I am available for booking via the Edinburgh Uni PPLS Writing Centre (this service is for PPLS students only though) and sometimes hold workshops on these topics. If you want to stay updated keep an eye on my Twitter @AndresKarjus.
But wait! There’s one more thing to do. Since this too is an R Markdown document, we can “knit” it into a nice HTML (or PDF, or Word) report file - it will show both the code and the plots produced by the code. Note that unfortunately this will not work if you have errors in your code - marked by the little red x signs on the left side vertical bar. To knit, click the Knit button (with the little blue ball of yarn) above the script window. If the code is without errors, an HTML document will appear.
Once you get around to working with your own data, you’ll need to import it into R to be able to make plots based on it. There are a number of ways of doing that; but also datasets and corpora come in different formats, so unfortunately there’s no single magic solution to import everything, you usually need to figure out the format of the data beforehand. Below are some examples.
This is probably the most common use case. If your data is in an Excel file formal (.xls, .xlsx), you are better off saving it as a plain text file (although there are packages to import directly from these formats, as well as from SPSS .sav files). The commands for that are read.table(), read.csv() and read.delim(). They basically all do the same thing, but differ in their default settings. For very large datasets or corpora, you might want to look into data.table instead.
# an example use case with parameters explained
mydata = read.table(file="path/to/my/file.txt", # full file path as a string
header=T, # if the first row contains column names
row.names=1, # if the 1st (or other) column is row names
sep="\t", # what character separates columns in the text file*
quote="", # if there are " or ' marks in any columns, set this to ""
)
# * "\t" is for tab (default if you save a text file from Excel), "," for csv, " " if space-spearated, etc
# for more and to check the defaults, see help(read.table)
# the path can be just a file name, if the file is in the working (R's "default") directory; use getwd() to check where that is, and setwd(full/path/to/folder) to set it (or you can use RStudio's Files tab, click on More)
# If your file has an encoding other than Latin or UTF-8, specify that using the encoding parameter.
mydata = read.table(file.choose() ) # alternatively: this opens a window to browse for files; specify parameters as appropriate
There is a simple way to import data from the clipboard. While importing from files is generally a better idea (you can always re-run the code and it will find the data itself), sometimes this is handy, like quickly grabbing a little piece of table from Excel. It differs between OSes:
mydata = read.table(file = "clipboard") # in Windows (add parameters as necessary)
mydata = read.table(file = pipe("pbpaste")) # on a Mac (add parameters as necessary)
For text, the readLines() command usually works well enough (its output is a character vector, so if the text file has 10 lines, then readLines produces a vector of length 10, where each line is an element in that vector (you could use strsplit() or quanteda’s functions to further split it into words. If the text is organized neatly in columns (e.g., like the COHA corpus), however, you might still consider read.table(), but probably with the stringsAsFactors=FALSE parameter (this avoids making long text strings into factors; check out the help file if needed). A corpus may be encoded using XML - there is the xml2 package (an improvement on the older XML package) for that, but watch out for memory leaks if importing and parsing multiple files (this is a know issue).
RStudio has handy options to export plots - click on Export on top of the plot panel, and choose the output format. Plots can be exported using R code as well - this is in fact a better approach, since otherwise you would have to click through the Export menus again every time you change your plot and need to re-export. Look into the help files of the jpeg() and pdf() functions to see how this works. ggplot2 has a handy ggsave() function. Interactive plots can be either included in R Markdown based html files, or exported as separate html files (which you can then upload as such, integrate into a website, or plug it in using an iframe).
There are also packages to import and manipulate images, text, GIS map data, relational databases, data from all sorts of other file formats (like XML, HTML, Google Sheets), scrape websites, do OCR on scanned documents, and much more. Just google around a bit and you’ll surely find what you need.
Social networks
The following example will look into plotting social networks of who knows who.
Let’s try something else. Using the same graph data, we’ll recreate it using another package, visNetwork, which makes graphs interactable (note that there are also other network packages, such as networkD3 and ggraph for ggplot2).